Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: bam::Record:new should return a valid record #361

Merged
merged 13 commits into from
May 22, 2024

Conversation

nh13
Copy link
Contributor

@nh13 nh13 commented Jun 17, 2022

See #339.

@nh13 nh13 changed the title bam::Record:new should return a valid record fix: bam::Record:new should return a valid record Jun 17, 2022
@nh13
Copy link
Contributor Author

nh13 commented Jun 18, 2022

@jch-13 would you mind approving running the workfllow?

@dcroote
Copy link

dcroote commented Jun 18, 2022

CI run approved

src/bam/record.rs Outdated Show resolved Hide resolved
@nh13
Copy link
Contributor Author

nh13 commented Jun 18, 2022

All tests passing!

@coveralls
Copy link

Pull Request Test Coverage Report for Build 2520638230

  • 4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.07%) to 94.827%

Files with Coverage Reduction New Missed Lines %
src/bam/record_serde.rs 1 87.31%
src/bcf/mod.rs 2 87.89%
Totals Coverage Status
Change from base Build 2296601963: 0.07%
Covered Lines: 11530
Relevant Lines: 12159

💛 - Coveralls

@dcroote
Copy link

dcroote commented Jun 18, 2022

I'm not the best to review this, but can a test be added to cover the issue this solves?

record.set_tid(-1);
record.set_pos(-1);
record.set_mpos(-1);
record.set_mtid(-1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that we should also set the record to be unmapped by default. I mean, how can it be mapped if the tid/pos are -1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable but I'd quickly compare third party implementations (htslib/htsjdk/noodles) and/or spec footnotes (lots of gotchas there).

src/bam/mod.rs Outdated Show resolved Hide resolved
@nh13
Copy link
Contributor Author

nh13 commented Jun 29, 2022

@dcroote who is a good person to tag for review?

@brainstorm brainstorm self-requested a review June 29, 2022 23:01
@brainstorm
Copy link
Member

Happy to help you out Nils! Please fix the no-default-features compile errors and the clippy warnings, the fix looks good.

src/bam/mod.rs Outdated Show resolved Hide resolved
src/bam/mod.rs Show resolved Hide resolved
};
// The read/query name needs to be set as empty to properly initialize
// the record
record.set_qname(b"");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the writer would not write the record unless the query name is set to something (empty is ok)

@@ -113,12 +113,23 @@ fn extranul_from_qname(qname: &[u8]) -> usize {
impl Record {
/// Create an empty BAM record.
pub fn new() -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal in this PR is to that one does not need to call any other methods on Record to have a valid record. Currently, this requires setting the read name and alignment positions (and unmapped flag if the positions are -1). So this PR has an opinion, that a new record is unmapped with empty name/seq/qual/cigar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for indirectly making me think about Default trait vs new(): https://users.rust-lang.org/t/when-to-use-default-trait/11190

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holtjma
Copy link

holtjma commented May 17, 2024

Just ran into this issue and the above work-arounds resolved it. Any chance this makes it into the default anytime soon?

@brainstorm
Copy link
Member

brainstorm commented May 20, 2024

Just ran into this issue and the above work-arounds resolved it. Any chance this makes it into the default anytime soon?

@holtjma Happy to merge the changes if you fix the current test/CI issues? Changes look reasonable, apologies for letting this sit for >2 years, mostly using noodles nowadays :-!

@holtjma
Copy link

holtjma commented May 20, 2024

@brainstorm Took a quick peak this morning to see if I could figure out what was going on. The problem seems to be with the qname getting set and serde tests. When I comment out the record.set_qname(b"");, all of the errors go away. I tried digging into why, and had it print out the differences to stdout:

---- bam::record_serde::tests::test_serde_json stdout ----
encoded: [{"tid":0,"pos":1,"bin":4681,"mapq":1,"qname_len":4,"flag":16,"n_cigar":3,"seq_len":100,"mtid":-1,"mpos":-1,"isize":0,"data":[73,0,0,0,176,1,0,0,18,0,0,0,144,4,0,0,34,129,66,34,129,18,34,129,18,34,129,18,34,129,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,31,33,30,23,33,30,33,32,31,31,35,35,33,34,35,35,34,33,34,31,34,35,34,35,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,88,71,67,1,88,77,67,5,88,78,67,0,88,79,67,1,88,83,99,238,65,83,99,238,89,84,90,85,85,0]},{"tid":0,"pos":1,"bin":4681,"mapq":1,"qname_len":12,"flag":16,"n_cigar":3,"seq_len":100,"mtid":-1,"mpos":-1,"isize":0,"data":[73,73,46,49,52,57,55,56,51,57,50,0,176,1,0,0,18,0,0,0,144,4,0,0,34,129,66,34,129,18,34,129,18,34,129,18,34,129,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,31,33,30,23,33,30,33,32,31,31,35,35,33,34,35,35,34,33,34,31,34,35,34,35,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,88,71,67,1,88,77,67,5,88,78,67,0,88,79,67,1,88,83,99,238,65,83,99,238,89,84,90,85,85,0]},{"tid":0,"pos":1,"bin":4681,"mapq":1,"qname_len":4,"flag":16,"n_cigar":3,"seq_len":100,"mtid":-1,"mpos":-1,"isize":0,"data":[73,73,73,0,176,1,0,0,18,0,0,0,144,4,0,0,34,129,66,34,129,18,34,129,18,34,129,18,34,129,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,31,33,30,23,33,30,33,32,31,31,35,35,33,34,35,35,34,33,34,31,34,35,34,35,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,88,71,67,1,88,77,67,5,88,78,67,0,88,79,67,1,88,83,99,238,65,83,99,238,89,84,90,85,85,0]},{"tid":0,"pos":1,"bin":4681,"mapq":40,"qname_len":4,"flag":16,"n_cigar":3,"seq_len":100,"mtid":-1,"mpos":-1,"isize":0,"data":[73,86,0,0,176,1,0,0,18,0,0,0,144,4,0,0,34,129,66,34,129,18,34,129,18,34,129,18,34,129,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,31,33,30,23,33,30,33,32,31,31,35,35,33,34,35,35,34,33,34,31,34,35,34,35,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,88,71,67,1,88,77,67,5,88,78,67,0,88,79,67,1,88,83,99,238,65,83,99,238,89,84,90,85,85,0]},{"tid":0,"pos":1,"bin":4681,"mapq":1,"qname_len":4,"flag":16,"n_cigar":3,"seq_len":100,"mtid":-1,"mpos":-1,"isize":0,"data":[86,0,0,0,176,1,0,0,18,0,0,0,144,4,0,0,34,129,66,34,129,18,34,129,18,34,129,18,34,129,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,66,40,17,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,31,33,30,23,33,30,33,32,31,31,35,35,33,34,35,35,34,33,34,31,34,35,34,35,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,88,71,67,1,88,77,67,5,88,78,67,0,88,79,67,1,88,83,99,238,65,83,99,238,89,84,90,85,85,0]},{"tid":0,"pos":1,"bin":585,"mapq":1,"qname_len":4,"flag":2048,"n_cigar":3,"seq_len":100,"mtid":-1,"mpos":-1,"isize":0,"data":[86,73,0,0,176,1,0,0,2,106,24,0,144,4,0,0,18,129,20,34,129,20,34,129,20,34,129,20,34,17,136,24,36,24,136,40,65,17,17,24,129,130,65,24,136,130,129,65,17,136,136,66,17,24,136,136,130,24,17,17,136,24,36,24,136,129,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,31,33,30,23,33,30,33,32,31,31,35,35,33,34,35,35,34,33,34,31,34,35,34,35,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34]}]
[73] => [73, 0, 0]
[73, 73, 46, 49, 52, 57, 55, 56, 51, 57, 50] => [73, 73, 46, 49, 52, 57, 55, 56, 51, 57, 50]
[73, 73, 73] => [73, 73, 73]
[73, 86] => [73, 86, 0]
[86] => [86, 0, 0]
[86, 73] => [86, 73, 0]
thread 'bam::record_serde::tests::test_serde_json' panicked at src/bam/record_serde.rs:353:9:
assertion failed: `(left == right)`

It seems like the qnames are getting 0-padded for some reason, but I'm not entirely sure why. Then those extra zeroes are triggering the mismatch in the tests.

I think Nils is correct in that the qname should be set to the empty string, but it's not obvious to me what's going on with this padding or why that extra set_qname would trigger the padding to appear. Any ideas on how to resolve this or poke at it further?

EDIT: Unrelatedly, I need to go dig into noodles again. I've been meaning to try it on a smaller project when I have time to get familiar with the difference!

@holtjma
Copy link

holtjma commented May 20, 2024

@brainstorm

Update: I found a putative patch, but don't seem to have permission to change this PR/branch (I think this is restricted from @nh13). There's a one line addition to fix_l_extranul in record_serde.rs:

fn fix_l_extranul(rec: &mut Record) {
    // first, reset the number of extranuls to 0 for calling .qname(); then calculate how many we actually have
    rec.inner_mut().core.l_extranul = 0;
    let l_extranul = rec.qname().iter().rev().take_while(|x| **x == 0u8).count() as u8;
    rec.inner_mut().core.l_extranul = l_extranul;
}

The change resets extranul prior to calling qname, allowing it to get the correct full block with padding and then shrink it back down.

@nh13
Copy link
Contributor Author

nh13 commented May 21, 2024

@holtjma this comes from my fork, do you want to make a PR to my branch or would you recommend something else?

@holtjma
Copy link

holtjma commented May 21, 2024

I was trying to avoid the extra PR since it’s a one-line change, but I can do that tomorrow if it’s preferred.

@nh13
Copy link
Contributor Author

nh13 commented May 22, 2024

@holtjma I think I did the right thing

@brainstorm brainstorm merged commit 87f2011 into rust-bio:master May 22, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants